Exercise 1: Melbourne housing

  1. Read in a copy of the Melbourne housing data from Nick Tierney’s github repo which is a collation from the version at kaggle. Its fairly large, so let’s start simply, and choose two suburbs to focus on. I recommend “South Yarra” and “Brighton”. (Note: there are a number of missing values. I recommend removing these before making plots.)
  2. Make a scatterplot matrix of price, rooms, bedroom2, bathroom, suburb, type. The plot will be easier to read if you put the numerical variables first, and then the categorical variables. What are the associations that can be seen?
  1. Subset the data to South Yarra only. Make an interactive scatterplot matrix of rooms, bedroom2, bathroom and price, coloured by type of property. There is a really high price property. Select this case, and determine what’s special about it – why did it sell for so much? Select the outlier in bedrooms and bathrooms, and examine the other characteristics of this property.

This property that has a high price has relatively modest characteristics! The property with 5 bathrooms for 3 bedrooms is fairly low priced. Maybe there is a mistake in the data and the bedrooms/bathrooms were swapped.

  1. Examine price vs rooms coloured by bathrooms, faceted by suburb and type, and with a linear model overlaid. What do you learn about average house prices relative to number of rooms and number of bathrooms, for the different property types and suburbs? (Remove the one really high priced property first, because it affects what we can learn about the rest of the data.)
  1. If we throw all the neighbourhoods in together to analyse price and property characteristics, what pitfall might we encounter?

Simpsons paradox. Suburb is an important factor in property price. The relationship between price and other characteristics are likely to be different by suburb, and this information will be lost.

Exercise 2: Olive oils

Following on from the olive oils example from class, we will explore the oils from the south here.

  1. Grab a copy of the data, and subset to contain just the samples from region = south (1), and also drop eicosenoic acid, because there is nothing useful about this variable for the southern oils.
  2. Only looking at areas (1-3), that is not Sicily:
    • Make an interactive parallel coordinate plot of the fatty acids (except eicosenoic), where the lines are coloured by area. (Code is provided, code is a bit tricky, but worth it!)
    • Look at the data in a tour.
    • Describe what you learn about differences between the three areas, whether these are separated. Are some variables more useful for distinguishing the three areas? Are there any outliers?

The three areas are quite different on a combination of palmitoleic, oleic, palmitic, and linoleic acids. There are some possible outliers, that can be found by selecting various lines, and noticing that it has a different trend than other lines.

The three areas are quite distinct. We could distinguish the growing area of the olive oils by examining the fatty acid composition.

  1. Re-do b. with Sicily. Explain what you learn about Sicily relative to the other areas.

Sicily overlaps with two of the other three areas. Most of the samples are not distinguishable from the other two.

  1. Do some googling. What can you find out about Sicilian olive oils? Are they higher in value? Does Sicily even grow olives, or does it use olives from neighbouring areas?

Not sure what the reason is! Maybe growing conditions in some fields are similar in the three areas. Maybe Sicily imports olives to make the oils from neighbouring areas.

Exercise 3: Baker field soils

  1. Make density plots of the soil variables in the Baker field corn yield data. Choose an appropriate transformation to symmetrise the distribution.

Many of the variables have a right-skewed distribution. For this thinking about a square root or log transformation would be appropriate. Ca has a severe right-skew, so a log-log transformation might be required. The code below shows the transformations made.

corn_trf <- corn %>%
  mutate(B = log10(B),
         Ca = log10(log10(Ca)),
         Cu = sqrt(Cu),
         K = log10(K), 
         Mg = sqrt(Mg), 
         Mn = sqrt(Mn),
         Na = sqrt(Na),
         P = log10(P),
         Zn = sqrt(Zn))

The transformed data has mostly symmetric, unimodal distributions now. Making these transformations is useful when considering the relationship between variables. If each variable is well-spread then the association is measured using most of the points, but if you try to assess the association between skewed distributions, the judgement is based on just a handful of observations.

  1. Make a scatterplot matrix. If you can make an interactive one, that would be extra special. Describe the relationships between pairs of variables.

  1. Make a grand tour of soil variables. Describe the different patterns that you see in various projections. Is there clustering? Is there linear dependence? Non-linear dependence? outliers. For any structure that you see determine which variables contribute to it, and make plots of these variables (r check the scatterplot matrix) to check whether the pattern is visible there too.
set.seed(11111)
animate_xy(corn_trf[,4:13], fps=10, half_range=1.0)

# Or use this
corn_trf_stdd <- tourr::rescale(corn_trf[,4:13])
set.seed(11111)
tpath <- tourr::save_history(corn_trf_stdd, 
                             tour_path = tourr::grand_tour(),
                             max = 20)
play_tour_path(tour_path = tpath, data = corn_trf_stdd)

The data is reasonably regular. There are some linear dependencies visible, potentially a few outliers, and possibly some slight nonlinear association.

Exercise 4: Exam marks

There is a dataset “mathmarks” in the SMPracticals package, which has marks out of 100 for 88 students. It is interesting to note that all students had marks for all tests, which makes one wonder whether marks for students who missed a test were dropped. Mechanics and vectors were closed book exams, and the others were open book.

  1. Make a side-by-side boxplot of the test scores. What do you learn about the test scores on the different subjects?

There is some difference in median and IQR across exams. Students tended to do better on vectors, algebra and analysis and worse on mechanics and statistics. There is no indication that the open book exams produced better scores than the closed book exams.

The distributions are fairly symmetric. The spread is similar except for algebra which are more concentrated. There is one quite high score on algebra, and two low scores. Mechanics and vectors has an unusually low score each.

  1. Make a scatterplot matrix, even better if it is interactive. Describe the relationships between the tests. Is there something different about the open book vs closed book scores?

Generally we can see positive linear association. Algebra and analysis are most related. Algebra is pretty closely related to all other test scores. Mechanics is only weakly related to statistics. There are a couple of outliers: low scores on algebra, and also two students who did badly on vectors but quite well on analysis. There looks to be a barrier at 80 for the statistics test, maybe this was the maximum possible score. Analysis and algebra, and also analysis and vectors has what I call a “comet” distribution: there is a tight concentration of high scores, and a big spread among lower scores. I see this a lot in statistics test scores.

  1. Make an interactive parallel coordinate plot. Are there some students who have done consistently well on all tests? Consistently badly on all tests? Badly on some but better on others?

There is one student who did much worse on the closed book exams. There is one student who did really well on the closed book vectors exam, but badly on all other exams. There are a few students who have done really well on all exams. Generally, a student does relatively as well across all the tests. There appears to be some slight bimodality in mechanics, analysis and statistics which is also visible in the density plots in the splom.

Exercise 5: Knowledge and resources

The “vcdExtra” package contains a dataset “Dyke” about how 1729 survey respondents’ knowledge of cancer depended on whether they listened to the radio, read newspapers, did solid reading, or attended lectures.

  1. Make separate bar charts for each of the explanatory variables, with bars filled by the response variable Knowledge. What do you learn?

  1. Make a 100% bar chart of Newspaper, with Knowledge mapped to fill, and faceted by Reading. What do you learn about the relative proportions in the groups?

  1. Make a doubledecker plot of the data. What combination of factors leads to the highest level of knowledge about cancer? What combination leads to the lowest?

The most knowledgeable people read, listen to the radio, attend lectures and read newspapers - as might be expected. There a very few people in this category, though. On each comparison the Yes group is more knowledgeable than the No group. The most populous group is the No to all combination. This is pretty shocking because it is the least knowledgeable about cancer. The second most common group are people who read newspapers, and read generally, and these are relatively knowledgeable about cancer.

Exercise 6: Parkinsons

This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson’s disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals (“name” column). The main aim of the data is to discriminate healthy people from those with PD, according to “status” column which is set to 0 for healthy and 1 for PD.

The data is available at The UCI Machine Learning Repository in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around six recordings per patient, the name of the patient is identified in the first column. There are 24 variables in the file, including the persons name in column 1.

The data are originally analysed in: Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), ‘Suitability of dysphonia measurements for telemonitoring of Parkinson’s disease’, IEEE Transactions on Biomedical Engineering (to appear).

  1. Compute the scagnostics for all pairs of variables, except for name.

  2. Sort the scagnostics, show the top 10 on (i) Monotonic (ii) Clumpy (iii) Your choice, and plot the pair of variables with the highest values.

##                                  Monotonic
## Shimmer:APQ3 vs Shimmer:DDA      0.9999998
## MDVP:RAP vs Jitter:DDP           0.9999993
## MDVP:Shimmer vs MDVP:Shimmer(dB) 0.9823538
## MDVP:Shimmer vs Shimmer:DDA      0.9782364
## MDVP:Shimmer vs Shimmer:APQ3     0.9781855
## spread1 vs PPE                   0.9650254

There are some variables that are exactly the same, up to a scale factor.

  1. Make an interactive scatterplot matrix. Browse over it to choose other interesting pairs of variables and make the plots.
# Create an interactive splom
s %>% 
  mutate(id = rownames(s)) %>%
  plot_ly() %>% 
  add_trace(
    type = 'splom',
      dimensions = list(
      list(label='Outlying', values=~Outlying),
      list(label='Skewed', values=~Skewed),
      list(label='Clumpy', values=~Clumpy),
      list(label='Sparse', values=~Sparse),
      list(label='Striated', values=~Striated),
      list(label='Convex', values=~Convex),
      list(label='Skinny', values=~Skinny),
      list(label='Stringy', values=~Stringy),
      list(label='Monotonic', values=~Monotonic)
    ),
    text=~id
  )

This pair of variables has both positive association, and outliers.

  1. The scagnostics help us to find interesting associations between pairs of variables. However, the problem here is to detect differences between Parkinsons patients and normal patients. How would you go about that? Think about some ideas long the line of scagnostics but look for differences between the two groups.

Generally we are looking for variables where the differences between the Parkinsons and normal patients are big. You need to measure big, relative to the variance of each group. Doing a two sample t-test for each variable is one approach. Here, I’ve computed the median for each group of patients and compared the difference in medians relative to the pooled standard deviation in each group. No reason for this there than to do something a little less standard.